Do multiple experimenters improve the reproducibility of animal studies?

The credibility of scientific research has been seriously questioned by the widely claimed “reproducibility crisis”. In light of this crisis, there is a growing awareness that the rigorous standardisation of experimental conditions may contribute to poor reproducibility of animal studies. Instead, systematic heterogenisation has been proposed as a tool to enhance reproducibility, but a real-life test across multiple independent laboratories is still pending. The aim of this study was therefore to test whether heterogenisation of experimental conditions by using multiple experimenters improves the reproducibility of research findings compared to standardised conditions with only one experimenter. To this end, we replicated the same animal experiment in 3 independent laboratories, each employing both a heterogenised and a standardised design. Whereas in the standardised design, all animals were tested by a single experimenter; in the heterogenised design, 3 different experimenters were involved in testing the animals. In contrast to our expectation, the inclusion of multiple experimenters in the heterogenised design did not improve the reproducibility of the results across the 3 laboratories. Interestingly, however, a variance component analysis indicated that the variation introduced by the different experimenters was not as high as the variation introduced by the laboratories, probably explaining why this heterogenisation strategy did not bring the anticipated success. Even more interestingly, for the majority of outcome measures, the remaining residual variation was identified as an important source of variance accounting for 41% (CI95 [34%, 49%]) to 72% (CI95 [58%, 88%]) of the observed total variance. Despite some uncertainty surrounding the estimated numbers, these findings argue for systematically including biological variation rather than eliminating it in animal studies and call for future research on effective improvement strategies.


Introduction
One core principle of science is the reproducibility of results, i.e., obtaining consistent results when replicating a study by collecting new data. However, over the past decade, the credibility of scientific results has been challenged by the fact that many replication studies failed to reproduce published results [1][2][3][4]. Based on such replication studies, the prevalence for irreproducible results was estimated to range between 50% and 90% and the majority of scientists in the life sciences are convinced that this "reproducibility crisis" needs to be addressed [1,5].
Failures to reproduce results of previous studies have been attributed to deficiencies in reporting standards, as well as to flaws in experimental design and statistical analysis [6][7][8][9][10]. To improve the situation, several guidelines such as the TOP [11], ARRIVE [12,13], and PRE-PARE [14] guidelines have been established to improve the planning, conduct, analysis, and reporting of studies. All of these attempts were already successful in increasing the overall quality of reported studies [15]. However, these strategies assume that the issue of irreproducibility can be comprehensively solved, provided that the experiment is planned and conducted with adequate expertise and that the methods are transparently reported in sufficient detail. With respect to animal research, one additional cause of irreproducibility is thereby often neglected or not adequately addressed. Every study involving animals is challenged by the fact that living organisms are highly responsive to their environment. This flexibility in the phenotype of an animal with a specific genotype towards different environmental cues is known as phenotypic plasticity [16].
Phenotypic plasticity leads to variation in results, even if all animals are genetically identical [17,18]. Plastic responses of an organism with a specific genotype towards its local environment (i.e., the laboratory environment) may result in remarkably different results across replicate studies [19]. This was impressively demonstrated by Crabbe and colleagues [20]. Despite extensive standardisation of the experimental conditions across 3 laboratories, they obtained conflicting results in behavioural differences between 8 mouse strains across the 3 laboratories. The authors concluded that small divergences in the local environment of the laboratories modulated the effects of genotype on behaviour, leading to idiosyncratic results. It was suggested that these divergences were most likely due to differences between the experimenters [21]. Indeed, also other studies produced remarkably different results when the animals were tested by different experimenters [22][23][24]. Even single aspects of the experimenter identity, such as the sex of the experimenter [22] or the way an animal is handled [25,26], have been identified to significantly affect the outcome. Furthermore, the importance of the factor "experimenter" was convincingly illustrated by a meta-analysis on acute pain sensitivity, showing that the experimenter accounted for more variance than any other known factor [27]. Thus, although it is not yet understood completely, why the factor "experimenter" has such a strong influence, it represents one of the top confounding factors in animal research.
The common approach to dealing with such factors is the rigorous standardisation of the experimental conditions. Thus, in most studies, all animals have the same age, are housed under the same conditions, and are tested by the same experimenter. This way, the variation within a study is expected to be minimised, and thus, the power to detect potential treatment effects is enhanced. However, this concept relies on the assumption of a fixed treatment effect, which can be detected by eliminating all sources of variation within a study. It ignores the fact that biological variation is an inherent characteristic of animal research and treatment effects may vary depending on the exact conditions to which a study is standardised [19,28]. Although diverging findings might provide novel insights about a phenomenon under investigation, meaningful conclusions can only be drawn if the exact differences between experiments are known. Researchers that standardise their study conditions rigorously to a set of often unknown factors (i.e., noise level, personnel), however, limit the inference space of the study to these narrowly defined conditions, thereby hampering the detection of such potentially meaningful study differences. Therefore, rigorous standardisation is at risk to produce idiosyncratic results that represent local "truths" and are often not reproducible.
This fallacy of enhancing reproducibility through standardisation becomes most apparent with respect to the experimenter effect: It is not only impossible to standardise experimenters across laboratories, but fundamentally misleading as the experimenter identity usually is of no biological interest for the study question. Instead of minimising variation by rigorous standardisation, embracing biological variability might be a better strategy to address replication failure. With regard to the experimenter, this would mean that instead of trying to eliminate this uncontrollable influencing factor in science, it could be used in a controlled way to systematically introduce variation in a study [29].
A growing body of evidence suggests that introducing heterogeneity in a controlled way to the study design (referred to as "systematic heterogenisation") increases the inference space of the results, leading to better reproducibility under varying background conditions [28,[30][31][32][33][34][35]. For example, splitting an experiment into several batches (i.e., mini-experiments with slightly varying conditions) improved the reproducibility of findings across replicate experiments within the same laboratory [36]. Whereas such data provide a convincing proof-of-concept of systematic heterogenisation, empirical evidence from tests across independent laboratories is still limited [33,37,38]. Therefore, further studies are urgently needed that (1) identify potential heterogenisation factors; and (2) empirically validate such strategies in a real-life situation. A successful and versatile heterogenisation strategy thereby comprises 2 different characteristics. First, the strategy needs to introduce sufficient variation in the study design to mimic the variation that usually occurs between studies. Second, the factor used to introduce variation should not be in the focus of the study question itself (i.e., is of no biological interest). In light of the discussion summarised above, the factor "experimenter" complies with both requirements and thus represents a promising factor for a heterogenisation strategy.
Against this background, the overall aim of the present multilaboratory study was to empirically test the potential of the experimenter as a heterogenisation factor. We expected improved reproducibility of research findings in comparison to a conventionally standardised design. In line with this assumption, we also aimed at estimating the amount of variance explained by multiple experimenters, assuming that the factor "experimenter" represents the major source of variation. To this end, the same animal experiment was replicated independently in 3 different laboratories, using both a heterogenised design with 3 experimenters being involved in testing the study sample and a standardised design with one experimenter testing the study sample. To assess reproducibility, a typical animal experiment in the field of biomedical research was mimicked. More precisely, many studies examine the role of specific genes in the modulation of the phenotype and therefore rely on the phenotypic characterisation of animals of different genotypes. To reflect such an experiment with a typical "treatment under investigation", 2 inbred mouse strains (i.e., different genotypes) were tested in a range of physiological and behavioural outcome variables commonly used in such phenotyping studies (cf. [20]).

Animals and housing conditions
In this study, 96 naïve female mice of 2 inbred mouse strains (C57BL/6J and DBA/2N, 48 mice per strain) were used in each of the 3 laboratories. To ensure consistent housing conditions across the 3 laboratories, female mice were chosen because, in contrast to male mice, they can be easily housed in stable groups without taking the risk of having to separate some mice over the course of the experimental phase (for a discussion about housing male mice, see [39,40]). All animals were provided by the same commercial supplier (Charles River Laboratories). As each experimenter conducted the experiment in an independent batch, animals were delivered for all experiments separately at an age of 7 weeks (for details, see S2 Table). Upon arrival, the animals were housed in same strain groups of 2 mice per cage. All animals of one experiment (12 per strain) were housed in the same rack, beginning at the top with the strains being allocated to their horizontal and vertical rack position in a balanced way. The allocation to the rack position was the same for all experiments and was harmonised across laboratories. The animals were housed according to laboratory-specific housing protocols (for details on, e.g., cage type, bedding material, and temperature, see S1 Table). Food pellets and tap water were provided ad libitum. Cages were cleaned weekly and housing rooms were maintained at a reversed 12/12 h light-dark cycle in all 3 laboratories.

Ethics statement
All procedures complied with the regulations covering animal experimentation within Germany (Animal Welfare Act), Switzerland (Swiss Animal Welfare Ordinance

Concept of the study
The present project was designed as a multilaboratory study, involving 3 independent laboratories in Germany and Switzerland (A: Veterinary Public Health Institute, Animal Welfare Division, University of Bern, Switzerland; B: Department of Behavioural Biology, University of Münster, Germany and C: Department of Behavioural Biology, University of Osnabrück, Germany). The overall aim was to compare a conventionally standardised design and a systematically heterogenised design in a real-life situation (cf. [33]). Therefore, each of 3 laboratories tested 2 inbred mouse strains (C57BL/6J and DBA/2N) for a variety of different behavioural and physiological outcome measures.
In contrast to the standardised design, the heterogenised design included variation among different experimenters in each laboratory. In detail, in the standardised design, all mice of one experiment (n = 12 mice per strain, see "Data analysis" for details on sample size estimation) were tested by one experimenter, while in the heterogenised design, 3 different experimenters were involved in conducting the experiment (i.e., 4 mice per strain and experimenter were included in the analysis; see Fig 1A). This allocation of the experimenters to the 2 designs was randomly chosen before the study was conducted using the randomisation software "Research Randomizer" [41].
To evaluate the reproducibility of the results in both designs and the factors that contribute the most to the variance in the data, 3 different analyses were applied: (1) reproducibility of the strain effect across laboratories; (2) exploration of impact of experimenter allocation; and (3) components of variance analysis.
In the first analysis, we compared the reproducibility of the results between both designs in terms of the consistency across the 3 laboratories and the performance of the experimental designs to predict the overall effect size. This analysis was completed on the basis of the prior determined experimenter allocation. In the second analysis, we evaluated if the results of the first analysis may have been due to the specific experimenters allocated to each design rather than due to the characteristics of the designs itself. Here, the first analysis was repeated on the basis of 10 alternative experimenter allocations (see "Experimental setup in each laboratory" and "Data analysis" for details).
In a third analysis, we used the data generated to disentangle the different components of variance and estimate their contribution to the total variance in this study (see "Data analysis" for details). To examine reproducibility of research findings across laboratories, the same animal experiment was independently repeated in 3 laboratories in both a conventionally standardised (red) and a heterogenised design (blue). In the standardised design, mice of 2 strains, C57BL/6J and DBA/2N (n = 12 per strain), were tested by one experimenter, while in the heterogenised design, 3 different experimenters were involved in conducting the experiment (i.e., 4 mice per strain and experimenter were included in the analysis). Each experimenter, regardless of the design, conducted the same animal experiment following the same standardised protocols. All test procedures were conducted in the same order for all animals. (b) Experimental setup in each laboratory. In each laboratory, 4 experimenters conducted the same animal experiment as described above. All 4 experiments were conducted according to a randomised block design including 3 blocks per experiment (Block 1-3). In each block, mice were housed in the same row in the same rack (e.g., top, middle, bottom) and, thus, these animals shared the same environmental background with respect to lighting conditions, humidity, and temperature. Out of these 4 experiments per laboratory, one was randomly selected and classified as standardised experiment (e.g., Exp A, red). For the heterogenised design, one block (i.e., 4 mice per strain) from each of the remaining 3 experimenters (Exp B-Exp D) was pseudo-randomly selected and classified as part of a heterogenised experiment. This was done in a way that in each heterogenised experiment all 3 blocks were represented. Please note: Shown is one example of a possible allocation of experimenters and blocks to the standardised and heterogenised design, respectively (for details, see "Data analysis" and S3 Table).

PLOS BIOLOGY
Improving reproducibility through multiple experimenters?

Experimental setup in each laboratory
In each of the 3 participating laboratories, 4 experimenters conducted the same animal experiment (C57BL/6J versus DBA/2N mice) following the same standardised test protocols (see "Experimental procedures"). In the laboratory in Bern, all experimenters were females, whereas in Münster and Osnabrück, male and female experimenters were involved in the study. Experimenters with different levels of experience in both, handling mice and conducting the test procedures, were involved in all laboratories (for details, see S2 Table). This diversity was deliberately chosen to cover a wide range of potential influencing factors, which are considered to account for experimenter effects (e.g., sex and experience [22,23]).
Theoretically, one experimenter per laboratory had to test a "full" experiment of 12 mice per strain in the standardised design, whereas the 3 experimenters allocated to the heterogenised design only needed to test a reduced number of 4 mice per strain. However, in practice, all experimenters tested the same number of animals as needed in one conventional standardised experiment (i.e., 12 mice per strain) to guarantee that all experimenters were blind to the experimental design. As a consequence, 4 "full" independent experiments were performed by 4 experimenters consecutively in each laboratory (for an overview of the results of all 4 experimenters in each laboratory, see S5-S14 Figs and S8 Table). Each experimenter tested 12 mice per strain (n = 12), except for the laboratory in Osnabrück. Here, due to technical reasons, the sample size of 2 experimenters (Exp A and B) was reduced to n = 11 for the "C57BL/6J" group and n = 9 and n = 8, respectively, for the "DBA/2N" group. Due to the consecutive testing of the 4 experimenters, experiments in the heterogenised design comprised not only animals tested by different experimenters, but also tested at different time points, while in the standardised design, all animals from one laboratory were delivered and tested at one specific point in time by one experimenter (Fig 1). Besides of providing blindness, the approach of testing 4 "full" experiments in each laboratory offered the opportunity to simulate different combinations regarding the allocation of the experimenters to the experimental designs (see "Data analysis" and S3 Table).
All "full" experiments were organised according to a randomised block design, where each experiment was divided into 3 "blocks" containing 4 mice per strain (see Fig 1B). Mice of one block were housed in the same row in the same rack (e.g., top, middle, bottom) and thus shared the same "microenvironment" with respect to lighting conditions, humidity, and temperature. This randomised block design provided the basis for the selection process of the animals to the experimental designs. More specifically, one "full" experiment (i.e., 3 blocks, tested by one experimenter) was randomly selected and assigned to the standardised design. For the heterogenised design, one block (i.e., 4 mice per strain) from each of the 3 remaining experimenters was pseudo-randomly selected and classified as part of one heterogenised experiment. Consequently, each selected heterogenised experiment contained 3 blocks tested by 3 different experimenters.

Experimental procedures
To examine reproducibility of behavioural and physiological differences between the 2 inbred mouse strains (C57BL/6J and DBA/2N), all mice were subjected to the same testing procedures. Thereby, behavioural paradigms were chosen in accordance with established protocols for the phenotypic characterisation of mice in experimental animal research [42].
Experimental procedures started on postnatal day (PND) 70 ± 2 and lasted for 8 days. Over the course of this period, the following tests were conducted during the active phase in the same order for all animals: Elevated Plus Maze (EPM) on PND 70 ± 2, Dark Light (DL) test on PND 71 ± 2, Open Field (OF) test on PND 72 ± 2, Novel Cage (NC) test on PND 76 ± 2, and Nest (NT) test starting on PND 77 ± 2 (see Fig 1). Additionally, faecal samples to determine levels of glucocorticoid metabolites non-invasively were collected on PND 73 ± 2 and the change in body weight over the course of the test phase, from PND 69 ± 2 to PND 78 ± 2, was measured. Details on the experimental procedures are given in S1 Text. Due to technical reasons, the body weight data of 3 mice are missing.
Four days before the start of the experimental phase, habituation of the animals to the designated experimenter took place. Each experimenter handled the mice at least 3 times before the start of the first test procedure (EPM). Animals tested by one experimenter were exclusively handled by this person during the whole experimental phase.
The order of mice tested on the same day was pseudo-randomised, following 2 rules. First, mice of neighbouring cages (C57BL/6J and DBA/2) were tested consecutively. Second, in the EPM, DL, OF, and NC tests, a break of at least 1 h between the testing of 2 cage mates was ensured to minimise influences caused by any disturbances in the cage.
While the experimenters were blind with respect to the allocation of the mice to the experimental design (conventionally standardised or heterogenised design), blinding to the mouse strain was not possible due to different fur colours of the 2 strains (C57BL/6J mice: black, DBA/2N mice: brown). However, the main outcome (i.e., reproducibility of strain differences across laboratories) was unlikely to be influenced by the lack of blinding at this stage of the study. Importantly, all experimenters were unaware of the results from other experimenters in their own laboratory and in the other participating laboratories.

Data analysis
In this study, 20 outcome measures were recorded, which derived from 7 experimental test procedures. To minimise dependencies between these outcome measures, all 20 outcome measures were checked for correlations among each other. We selected 10 outcome measures (see S4 Table and S4 Fig), which had a correlation coefficient <0.5 and were therefore not highly correlated to each other (cf. [36,37]). To avoid any biases in the selection process, the whole selection process was completed by an experimenter blind to the specific outcome measures. Outcome measures for exclusion were determined in a way that as few outcome measures as possible had to be excluded in this process. Whenever only 2 outcome measures were correlated with each other, it was randomly chosen which one was excluded. The whole selection process was conducted on the basis of the full data set, irrespective of the allocation of the experimenters to the experimental designs and before the following analyses (see below) were carried out.
To estimate the sample size in our study, an a priori power analysis was conducted using previously published data by Bodden and colleagues [31]. According to this data, we expected large effect sizes for the primary outcome of interest: the strain-by-laboratory interaction. With a sample size of n = 12 mice per strain and laboratory, we could ensure to detect biologically relevant variations of the strain effect across the 3 laboratories with a power of 80%.
Reproducibility of the strain effect across laboratories. The main analysis to evaluate the reproducibility of the treatment effect across the 3 laboratories and compare it between both experimental designs was adapted from von Kortzfleisch and colleagues [36] and comprised the following 2 approaches: (I.) calculating the consistency of the strain effect across laboratories; and (II.) estimating how often and how accurately the overall effect was predicted in the laboratories.
(I.) Consistency of the strain effect across laboratories. The consistency of the strain effect across laboratories is statistically reflected in the variation captured by the interaction between the factors "strain" and "laboratory" (i.e., "strain-by-laboratory" interaction). To assess this "strain-by-laboratory" interaction term as a measurement for reproducibility, a linear mixed model (LMM) was applied to both designs (standardised and heterogenised). Details on this analysis and the model equation are presented in S2 Text and S5 Table. (II.) Estimation of how often and how accurately the overall effect is predicted in the laboratories. How good each experimental design predicted the overall effect size was assessed by the following 2 measurements that were adapted from the analysis by Voelkl and colleagues [32]: The coverage probability (Pc) and the proportion of accurate results (Pa). The Pc was assessed by counting how often the effect size estimates from the 3 laboratories and their corresponding confidence intervals (CI 95 ) cover the overall effect size. The Pa was determined by counting how often the standardised and heterogenised experiments in the different laboratories predicted the overall effect size accurately with respect to its statistical significance. This was done for each experimental design separately. For details on this analysis, see S2 Text.
Impact of experimenter allocation. Regarding the allocation of the experimenters to the 2 experimental designs in each laboratory, several combinations were possible. It might be plausible that the results are dependent on the specific experimenters selected for each experimental design. Therefore, to gain more confidence in the conclusion when comparing both experimental designs, we repeated the previously outlined analyses (I. and II.) for 10 randomly chosen, alternative allocations of the experimenters to the experimental designs (for details, see S3 Table).
Components of variance. To assess the influence of the different laboratories and experimenters, we conducted a follow-up analysis, where the data of all mice tested by each experimenter in all 3 laboratories were combined and the proportion of variation due to different sources was calculated. In detail, the total amount of variation in this dataset can be separated into the between-strain variability, between-laboratory variability, between-experimenter variability, strain-by-laboratory interaction variability, strain-by-experimenter interaction variability, between-block variability, strain-by-block interaction variability, between-cage variability, and between-individual variability (residuals). Therefore, the proportion of variance attributable to each factor was estimated using an LMM (see S2 Text).
Nearly all statistical analyses were conducted and graphs created using the statistical software "R" [43] (Version 4.0.2). Only testing for the correlation among outcome measures was done using the statistical software "IBM SPSS Statistics" (IBM Version 23), and the power calculation was done using the statistical software G � Power [44]. Differences were considered to be statistically significant when p � 0.05.

Reproducibility of the strain effect across laboratories
Descriptively, some of the detected strain effects in this study could be well reproduced at all 3 sites, but for others, we found remarkably different results between the 3 laboratories (Fig 2). For example, all 3 laboratories detected a significant strain effect (i.e., CI 95 intervals distinct from 0) regarding the time spent in the light compartment in the DL with C57BL/6J mice spending more time in the light compartment than DBA/2N mice (Fig 2D). However, for half of all outcome measures, the effect sizes varied across the 3 laboratories (i.e., nonoverlapping CI 95 intervals between laboratories) with some laboratories detecting a significant strain effect and some not (e.g., "OF centre time"; Fig 2E). Most interestingly, completely contradicting conclusions were found regarding the number of "rearings" in the NC in both designs ( Fig  2G), reflecting an example of severely hampered reproducibility. Whereas in Lab A, C57BL/6J mice reared less often than DBA/2N mice; in Lab C, C57BL/6J mice were characterised by higher numbers of "rearing" behaviour than DBA/2N. By contrast, Lab B found no difference between the 2 mouse strains. Thus, with respect to this outcome measure, we found 3 different conclusions (i.e., lower levels, higher levels, and no difference) about the strain effect in 3 different laboratories.
To statistically compare the extent of reproducibility of the strain effect among the different laboratories between the standardised and the heterogenised design, 2 different approaches were used: First, we calculated the consistency of the strain effects across laboratories and second, we estimated how often (Pc) and how accurately (Pa) the results from each laboratory predicted the overall effect. Concerning the consistency of findings across laboratories, the pvalues of the "strain-by-laboratory" interaction term did not differ significantly between the 2 designs (Fig 3A; Wilcoxon signed-rank test (paired, one-tailed, n = 10): V = 22, pvalue = 0.31). Likewise, we could not detect any significant difference between the 2 designs in the performance to predict the overall effect (Fig 3B and 3C; Wilcoxon signed-rank test (paired, one-tailed, n = 10): Pc: V = 6, p-value = 0.39; Pa: V = 1.5, p-value = 0.14).

Impact of experimenter allocation
The allocation of the experimenters to the experimental designs was randomly chosen. Thus, several different combinations of experimenters selected in each laboratory and allocated to each design are theoretically possible. To check whether alternative allocations of the experimenters would have altered the results when comparing both experimental designs, we repeated the analyses on basis of 10 additionally, randomly selected, alternative allocations (for details, see S3 Table).
Overall, in accordance with the initial finding, no significant differences could be detected regarding the reproducibility of the results between the experimental designs. More specifically, when examining the consistency of the strain effect of all 10 alternative allocations, one alternative allocation led to a significantly improved consistency of the strain effect in the heterogenised design (Wilcoxon signed-rank test (paired, one-tailed, n = 10): V = 9, pvalue = 0.03). However, for the 9 remaining allocations, no significant differences between the experimental designs could be detected (Wilcoxon signed-rank test, paired, one-tailed, n = 10; for details, see S1 Fig and S6 Table). Likewise, with respect to the Pc and Pa ratios, no significant differences between both experimental designs could be found (Wilcoxon signed-rank test, paired, one-tailed, n = 10; for details, see S2 and S3 Figs and S6 Table).

Components of variance
In addition, the full dataset was analysed to estimate the influence of the different factors on the variance of the data (for a summary, please see Fig 4). Although the specific amount of variation attributable to these different factors could only be estimated with some uncertainty, an overall pattern emerged. For 9 out of 10 outcome measures, the factor "laboratory" (main effect (blue) and interaction effect with strain (dark blue)) accounted for more variation than the factor "experimenter" (main effect (red) and interaction effect with strain (orange)). More precisely, the "laboratory" explained on average (median) 25% of the variation in the data (with a range from 0%, CI 95 [0%, 17%] to 64%, CI 95 [5%, 100%]), whereas on average, only 5% (median, with a range from 0%, CI 95 [0%, 13%] to 27%, CI 95 [2%, 97%]) of the total variance could be assigned to the factor "experimenter" (for details, please see S7 Table, which gives an overview of the variation explained by each factor including uncertainty measures (i.e., CI 95 intervals) for all outcome measures). In addition, the proportion of variance that could not be assigned to any known source (= between-individual variability or residual variance, grey) accounted for 47% of the total variance (median, with a range from 10%, CI 95 [8%, 13%] to 72%, CI 95 [58%, 88%]). More specifically, in 6 out of 10 outcome measures, the residual

Discussion
The aim of this study was to empirically compare an experimenter heterogenisation strategy with a conventionally standardised design in a multilaboratory study. In contrast to our expectations, the experimenter heterogenisation led neither to more consistency across the 3 laboratories nor to a better performance in detecting the overall treatment effect compared to a conventionally standardised design. This finding was independent of the concrete constellation of experimenters contributing to the standardised and the heterogenised designs, as it remained the same when basing the analysis on 10 alternative, randomised allocations.
The reasons for these unexpected findings might be 2-fold: (1) either both experimental designs were characterised by "perfect" reproducibility across the 3 laboratories, creating a ceiling effect; or (2) the variation introduced by the different experimenters in the heterogenised design was not large enough to mimic the variation between laboratories.
Concerning the first point, the reproducibility of the results across the 3 laboratories was far from being "perfect". Although, for some outcome measures, all 3 laboratories detected a significant strain effect of the same direction (e.g., "light compartment time in the DL" or "faecal corticosterone metabolites"); for half of all outcome measures, the conclusions from the single laboratories regarding the detected strain effect differed profoundly. The inconsistencies between laboratories ranged from variation in effect size to variation in the direction of the effects, including a case of significant effects in opposite directions ("rearings in the NC").
Regarding the second point, although the experimenter identity is known as one of the top influencing factors in many fields of animal research [22,23,25], the systematic heterogenisation of this factor in the present study was not sufficient to improve the reproducibility of results across laboratories. Theoretically, the efficiency of a heterogenisation strategy is dependent on the amount of variation that is introduced by this specific factor. The experimenter heterogenisation in our study was realised on the basis of a multibatch design, meaning that For each outcome measure, the total variance of the full dataset could be decomposed into the following sources using an LMM: between-strain variability (yellow), betweenlaboratory variability (blue), between-experimenter variability (red), strain-by-laboratory interaction variability (dark blue), strain-by-experimenter interaction variability (orange), between-block variability (dark green), strain-by-block interaction variability (light green), between-cage variability (beige), and between-individual variability (residuals, grey). Shown are point estimates of the proportion of variation explained by each factor. For details on 95% confidence intervals of these estimates, see S7

PLOS BIOLOGY
Improving reproducibility through multiple experimenters? each experimenter tested the animals in a different batch. This way, the introduced variation relied not only on the experimenter identities (e.g., sex, experience in handling mice), but it also comprised the environmental conditions under which the dedicated experimenter tested the animals (e.g., temperature, noise level). Including variation by splitting an experiment into several smaller ones spread over time (i.e., different batches) has already been shown to improve the reproducibility of results [34,36]. Hence, such an approach is particularly promising in exacerbating the experimenter effects. However, in our study, the factor "experimenter" accounted for on average only 5% of the observed variation in the data (median, with a range from 0%, CI 95 [0%, 13%] to 27%, CI 95 [2%, 97%]), indicating a rather small to moderate influence on outcome measures (for a detailed discussion on the experimenter effect in each laboratory, see S3 Text). Moreover, in 9 out of 10 outcome measures, the variation attributable to this factor appeared to be smaller than the variation introduced by the 3 laboratories. Although these explained variances are estimated with some uncertainty, it is likely that the variation introduced by multiple experimenters was therefore not sufficient enough to cover the variation that inevitably exists between different laboratories. Whether this finding is limited to the exemplarily chosen treatment effect in this study (i.e., physiological and behavioural mouse strain differences) or whether an experimenter heterogenisation might be more beneficial in studies involving extensive handling of the animals by the experimenter (cf. [23,27]) remains to be tested.
A glance at the growing body of heterogenisation literature demonstrates good evidence that heterogenisation strategies for single-laboratory studies in general help to render the results more generalisable and improve the reproducibility of the results across replicate experiments compared to standardised designs [30,31,33,34,36,37]. Interestingly, so far, this only holds true if the replication is done in the same laboratory, but not necessarily if the experiment is replicated at different locations [33]. Likewise, improved reproducibility was observed in a setting where environmental conditions between different laboratories did not differ largely [37]. Together, this argues for heterogenisation strategies in single-laboratory studies to have the potential to increase the inference space of the results and thereby to enhance the generalisability of the conclusions (i.e., across different experimenter identities or over time). At the same time, however, they may not broaden the inference space sufficiently to cover disparate laboratory environments and thus guarantee reproducibility across laboratories.
Recently, a database-driven simulation of a multilaboratory approach showed that a "laboratory heterogenisation", including 2 to 4 laboratories in one study, rendered the results more robust and reproducible across replicates in different laboratories [32]. By using the laboratory as a heterogenisation factor, many different factors (e.g., light-dark cycle, housing conditions, strain of the animals) were systematically varied at the same time. Therefore, the laboratory served as kind of an "umbrella factor" for a set of known and unknown background factors, which was highly efficient in reflecting the "real-life" variation. However, the implementation of such a multilaboratory heterogenisation strategy in practice is highly demanding and might not be easily applicable to many studies.
Concerning sources of variation, the results of our study point to a further, so far widely overlooked aspect. Although the data were collected across different laboratories and experimenters, in 6 out of 10 outcome measures, our analysis suggests that a great proportion of variation (41%, CI 95 [34%, 49%] to 72%, CI 95 [58%, 88%] could not be explained by these factors or other often discussed sources of variance (e.g., cage). Instead, they were attributed to interindividual differences of unknown sources (residuals). Whether this observation only holds true for the here examined physiological and behavioural differences between inbred mouse strains or can be applied to other settings in animal research (e.g., to nonbehavioural readouts and interventions) needs to be further examined. Yet, our findings are in line with Kafkafi and colleagues [45], who also reported interindividual differences between mice to account for a high proportion of variation (up to 75%). Therefore, these findings may serve as an impressive example of how much biological variation can exist despite strict standardisation regimes. This is particularly alarming for all scientists conducting animal studies as standardisation is still considered as the "gold standard" to create controlled and homogeneous conditions in animal experiments and, thereby, to minimise random variation in the data (i.e., "noise"). It is likely that this high amount of unexplained variation emerges due to complex interactions between known but also unknown factors we are not even aware of [46]. Therefore, the identification of "umbrella factors" encompassing these known and unknown background factors might present a promising solution to integrate such uncontrolled variation systematically into the study design [38]. Thus, instead of trying to understand and eliminate all sources of unexplained variation in animal studies, future studies should aim at developing more efficient strategies to embrace such heterogeneity in study populations and, thereby, to improve generalisability and reproducibility of research findings.  Table. Allocations of the experimenters to the standardised and heterogenised design. Listed are the randomly chosen combination for the main analysis (Combination 0) and 10 alternative combinations. In each lab, the allocation process consisted of the following steps. First, all mice tested by one randomly chosen experimenter (Exp A-Exp D) were assigned to be part of the standardised design. Second, the remaining 3 experimenters in each lab were assigned to be part of the heterogenised design and one block of mice (B1 -B3) from each experimenter was pseudo-randomly selected for the heterogenised experiment in each lab. This was done in a way that in each heterogenised experiment, all 3 blocks were represented. Please note, due to the loss of animals, the datasets generated by Exp A and B in Lab C were only used for the heterogenised design in a restricted way. In detail, block B3 from Exp A and B2 from Exp B were excluded from the selection process.